STUDENT LOANS CHALLENGE

COURSERA ML CHALLENGE


This notebook was created to document the steps taken to solve the Predict Students’ Ability to Repay Educational Loans posted on the Data Science Community in Coursera.

The data is aviable at:

https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv.

Documentation for the data is available at https://collegescorecard.ed.gov/data/documentation/. There is a data dictionary at https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx.

WORKFLOW

The Workflow suggested in https://www.kaggle.com/startupsci/titanic-data-science-solutions is going to be followed. The Workflow is the following:

    Question or problem definition.
    Acquire training and testing data.
    Wrangle, prepare, cleanse the data.
    Analyze, identify patterns, and explore the data.
    Model, predict and solve the problem.
    Visualize, report, and present the problem solving steps and final solution.
    Supply the results.

The workflow indicates general sequence of how each stage may follow the other. However, there are use cases with exceptions:

    We may combine mulitple workflow stages. We may analyze by visualizing data.
    Perform a stage earlier than indicated. We may analyze data before and after wrangling.
    Perform a stage multiple times in our workflow. Visualize stage may be used multiple times.
    Drop a stage altogether. We may not need supply stage to productize or service enable our dataset for a competition.

Problem Definition

Test to see if a set of institutional features can be used to predict student otucomes, in particular debt repayment. This solution is intended to try to explore to what extent instututional characteristics as well as certain demographic factors can indicate or predict debt repayment.

The (US) “College Scorecard” (the data set) includes national data on the earnings of former college graduates and new data on student debt.

Import Libraries

First import the libraries that are going to be used:


In [2]:
# data analysis and manipulation
import numpy as np
import pandas as pd
np.set_printoptions(threshold=1000)

# visualization
import seaborn as sns
import matplotlib.pyplot as plt

#machine learning
import tensorflow as tf

#Regular expression
import re

Acqure Data

The data is acquired using pandas (I renamed the file to CollegeScorecardData.csv)


In [3]:
all_data = pd.read_csv('datasets/CollegeScorecardData.csv')


C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2717: DtypeWarning: Columns (6,9,1170,1171,1172,1173,1174,1175,1177,1178,1179,1181,1183,1184,1185,1186,1187,1188,1190,1192,1196,1198,1199,1200,1201,1209,1211,1212,1213,1214,1222,1223,1224,1225,1226,1227,1235,1236,1237,1238,1239,1240,1248,1249,1251,1252,1253,1257,1261,1262,1264,1265,1266,1270,1274,1275,1276,1277,1278,1279,1282,1283,1287,1288,1289,1290,1291,1292,1295,1296,1309,1322,1326,1327,1328,1329,1330,1331,1333,1334,1335,1339,1340,1341,1342,1343,1344,1346,1347,1348,1379,1380,1381,1382,1383,1384,1385,1386,1387,1388,1389,1390,1391,1392,1393,1394,1395,1396,1397,1398,1399,1400,1401,1402,1403,1404,1405,1406,1407,1411,1425,1427,1489,1490,1491,1492,1493,1494,1495,1496,1497,1498,1499,1500,1501,1502,1503,1517,1529,1530,1532,1537,1540,1541,1542,1546,1575,1576,1577,1578,1579,1580,1581,1582,1583,1584,1585,1586,1587,1588,1589,1590,1591,1592,1593,1594,1595,1596,1597,1598,1599,1600,1601,1602,1603,1605,1606,1608,1609,1610,1613,1614,1615,1619,1620,1621,1622,1623,1624,1625,1626,1627,1628,1629,1636,1637,1638,1639,1640,1641,1642,1643,1644,1645,1646,1647,1648,1649,1650,1651,1652,1653,1654,1655,1656,1657,1658,1659,1660,1661,1662,1663,1664,1665,1666,1667,1668,1669,1670,1671,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682,1683,1684,1685,1686,1687,1688,1689,1690,1691,1692,1693,1694,1695,1696,1697,1698,1699,1700,1701,1702,1703,1704,1705,1706,1707,1725,1726,1727,1728,1729) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Analyze Data

First, let's see a little bit of the data


In [4]:
all_data.head()


Out[4]:
UNITID OPEID OPEID6 INSTNM CITY STABBR ZIP ACCREDAGENCY INSTURL NPCURL ... D100_L4 TRANS_4 DTRANS_4 TRANS_L4 DTRANS_L4 ICLEVEL UGDS_MEN UGDS_WOMEN CDR3_DENOM CDR2_DENOM
0 100654 100200 1002 Alabama A & M University Normal AL 35762 Southern Association of Colleges and Schools C... www.aamu.edu/ galileo.aamu.edu/netpricecalculator/npcalc.htm ... NaN 0.0000 1044.0 NaN NaN 1 0.4831 0.5169 1895.0 1574.0
1 100663 105200 1052 University of Alabama at Birmingham Birmingham AL 35294-0110 Southern Association of Colleges and Schools C... www.uab.edu www.collegeportraits.org/AL/UAB/estimator/agree ... NaN 0.2408 1246.0 NaN NaN 1 0.4169 0.5831 3937.0 3481.0
2 100690 2503400 25034 Amridge University Montgomery AL 36117-3553 Southern Association of Colleges and Schools C... www.amridgeuniversity.edu www2.amridgeuniversity.edu:9091/ ... NaN 0.0000 9.0 NaN NaN 1 0.3986 0.6014 344.0 336.0
3 100706 105500 1055 University of Alabama in Huntsville Huntsville AL 35899 Southern Association of Colleges and Schools C... www.uah.edu finaid.uah.edu/ ... NaN 0.3477 765.0 NaN NaN 1 0.5733 0.4267 1513.0 1392.0
4 100724 100500 1005 Alabama State University Montgomery AL 36104-0271 Southern Association of Colleges and Schools C... www.alasu.edu www.alasu.edu/cost-aid/forms/calculator/index.... ... NaN 0.0000 1330.0 NaN NaN 1 0.3877 0.6123 2137.0 1961.0

5 rows × 1743 columns

Find information about the features

Let's find more about the data


In [5]:
all_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7703 entries, 0 to 7702
Columns: 1743 entries, UNITID to CDR2_DENOM
dtypes: float64(443), int64(13), object(1287)
memory usage: 102.4+ MB

There are 7703 examples and 1743 features.

There are 443 float features that may be numeric, 13 integer features that may be categorical, and 1287 features that are strings, but may be numbers but data was not entered correctly (for example, if there was not data for a given feature, someone could have written "blank"). Given the high number of non numerical features, we need to explore them more. Luckly, there is a dictionary provided with the data, so we can explore it a little bit to learn about the data (The original file was converted do CSV)


In [6]:
data_dict = pd.read_csv('datasets/CollegeScorecardDataDictionary.csv')

In [7]:
data_dict.head()


Out[7]:
NAME OF DATA ELEMENT dev-category developer-friendly name API data type VARIABLE NAME VALUE LABEL SOURCE NOTES
0 Unit ID for institution root id integer UNITID NaN NaN IPEDS Shown/used on consumer website.
1 8-digit OPE ID for institution root ope8_id integer OPEID NaN NaN IPEDS Shown/used on consumer website.
2 6-digit OPE ID for institution root ope6_id integer OPEID6 NaN NaN IPEDS Shown/used on consumer website.
3 Institution name school name autocomplete INSTNM NaN NaN IPEDS Shown/used on consumer website.
4 City school city autocomplete CITY NaN NaN IPEDS Shown/used on consumer website.

In [8]:
data_dict.tail()


Out[8]:
NAME OF DATA ELEMENT dev-category developer-friendly name API data type VARIABLE NAME VALUE LABEL SOURCE NOTES
1970 NaN school NaN NaN NaN 3.0 Less-than-2-year NaN NaN
1971 Total share of enrollment of undergraduate deg... student demographics.men float UGDS_MEN NaN NaN IPEDS NaN
1972 Total share of enrollment of undergraduate deg... student demographics.women float UGDS_WOMEN NaN NaN IPEDS NaN
1973 Number of students in the cohort for the two-y... repayment 2_yr_default_rate_denom integer CDR2_DENOM NaN NaN FSA NaN
1974 Number of students in the cohort for the three... repayment 3_yr_default_rate_denom integer CDR3_DENOM NaN NaN FSA NaN

In [9]:
data_dict.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1975 entries, 0 to 1974
Data columns (total 9 columns):
NAME OF DATA ELEMENT       1734 non-null object
dev-category               1865 non-null object
developer-friendly name    1734 non-null object
API data type              1734 non-null object
VARIABLE NAME              1734 non-null object
VALUE                      268 non-null float64
LABEL                      306 non-null object
SOURCE                     1734 non-null object
NOTES                      274 non-null object
dtypes: float64(1), object(8)
memory usage: 138.9+ KB

There are 1975 entries, but the column NAME OF DATA ELEMENT has only 1734 not nut elements, so something is up. Let's try to explore the dict a little bit more


In [10]:
data_dict[5:10]


Out[10]:
NAME OF DATA ELEMENT dev-category developer-friendly name API data type VARIABLE NAME VALUE LABEL SOURCE NOTES
5 State postcode school state string STABBR NaN NaN IPEDS Shown/used on consumer website.
6 ZIP code school zip integer ZIP NaN NaN IPEDS NaN
7 Accreditor for institution school accreditor string ACCREDAGENCY NaN NaN FSA NaN
8 URL for institution's homepage school school_url string INSTURL NaN NaN IPEDS Shown/used on consumer website.
9 URL for institution's net price calculator school price_calculator_url string NPCURL NaN NaN IPEDS Shown/used on consumer website.

Nothing suspicius here, lets try again


In [11]:
data_dict[10:20]


Out[11]:
NAME OF DATA ELEMENT dev-category developer-friendly name API data type VARIABLE NAME VALUE LABEL SOURCE NOTES
10 Predominant degree awarded (recoded 0s and 4s) school degrees_awarded.predominant_recoded integer SCH_DEG NaN NaN IPEDS/NSLDS Missing values, 0s, and 4s from PREDDEG recode...
11 Schools that are on Heightened Cash Monitoring... school under_investigation integer HCM2 NaN NaN FSA Shown/used on consumer website; Flag (1=HCM2)
12 Flag for main campus school main_campus integer MAIN 0.0 Not main campus IPEDS NaN
13 NaN school NaN NaN NaN 1.0 Main campus NaN NaN
14 Number of branch campuses school branches integer NUMBRANCH NaN NaN IPEDS NaN
15 Predominant undergraduate degree awarded\n 0 N... school degrees_awarded.predominant integer PREDDEG 0.0 Not classified IPEDS Shown/used on consumer website.
16 NaN school NaN NaN NaN 1.0 Predominantly certificate-degree granting NaN NaN
17 NaN school NaN NaN NaN 2.0 Predominantly associate's-degree granting NaN NaN
18 NaN school NaN NaN NaN 3.0 Predominantly bachelor's-degree granting NaN NaN
19 NaN school NaN NaN NaN 4.0 Entirely graduate-degree granting NaN NaN

Aha! It seems that the feature at index 15 is categorical, and that's why the rows that follow it don't have a value under NAME OF DATA ELEMENT. Just for now, let's get rid of those NAN rows.


In [12]:
data_dict_no_nan_names = data_dict.dropna(subset=['NAME OF DATA ELEMENT'])
data_dict_no_nan_names[10:20]


Out[12]:
NAME OF DATA ELEMENT dev-category developer-friendly name API data type VARIABLE NAME VALUE LABEL SOURCE NOTES
10 Predominant degree awarded (recoded 0s and 4s) school degrees_awarded.predominant_recoded integer SCH_DEG NaN NaN IPEDS/NSLDS Missing values, 0s, and 4s from PREDDEG recode...
11 Schools that are on Heightened Cash Monitoring... school under_investigation integer HCM2 NaN NaN FSA Shown/used on consumer website; Flag (1=HCM2)
12 Flag for main campus school main_campus integer MAIN 0.0 Not main campus IPEDS NaN
14 Number of branch campuses school branches integer NUMBRANCH NaN NaN IPEDS NaN
15 Predominant undergraduate degree awarded\n 0 N... school degrees_awarded.predominant integer PREDDEG 0.0 Not classified IPEDS Shown/used on consumer website.
20 Highest degree awarded\n 0 Non-degree-granting... school degrees_awarded.highest integer HIGHDEG 0.0 Non-degree-granting IPEDS Shown/used on consumer website.
25 Control of institution school ownership integer CONTROL 1.0 Public IPEDS Shown/used on consumer website.
28 FIPS code for state school state_fips integer ST_FIPS 1.0 Alabama IPEDS NaN
86 Region (IPEDS) school region_id integer REGION 0.0 U.S. Service Schools IPEDS NaN
96 Locale of institution school locale integer LOCALE 11.0 City: Large (population of 250,000 or more) IPEDS Shown/used on consumer website.

Lets get the info of the new dict


In [13]:
data_dict_no_nan_names.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1734 entries, 0 to 1974
Data columns (total 9 columns):
NAME OF DATA ELEMENT       1734 non-null object
dev-category               1734 non-null object
developer-friendly name    1734 non-null object
API data type              1734 non-null object
VARIABLE NAME              1734 non-null object
VALUE                      27 non-null float64
LABEL                      65 non-null object
SOURCE                     1734 non-null object
NOTES                      274 non-null object
dtypes: float64(1), object(8)
memory usage: 135.5+ KB

We are interested primarly in the NAME OF DATA ELEMENT, VARIABLE NAME and API data type. They seem complete. Let's see howe many data types there are


In [14]:
data_dict_no_nan_names['API data type'].unique()


Out[14]:
array(['integer', 'autocomplete', 'string', 'float'], dtype=object)

Let's find out how many features have each data type


In [15]:
data_dict_no_nan_names['API data type'].value_counts()


Out[15]:
float           1206
integer          521
string             4
autocomplete       3
Name: API data type, dtype: int64

So in reality, there are 1206 float features, 521 integers, and 7 string features. (For now we assume that the autocomplete type is string). This numbers differ a lot from our previus analisys, in which we had 443 float features, 13 integer features and 1287 features that are strings.

Also, we cannot asume that all features of type integer are categorical, for example the ZIP code feature is integer but is not a categorical feature.

Let's find more about the autocomplete features:


In [16]:
data_dict_no_nan_names[data_dict_no_nan_names['API data type'] == 'autocomplete']


Out[16]:
NAME OF DATA ELEMENT dev-category developer-friendly name API data type VARIABLE NAME VALUE LABEL SOURCE NOTES
3 Institution name school name autocomplete INSTNM NaN NaN IPEDS Shown/used on consumer website.
4 City school city autocomplete CITY NaN NaN IPEDS Shown/used on consumer website.
1959 Institution name aliases school alias autocomplete ALIAS NaN NaN IPEDS NaN

We can see that these autocomplete features can be treated as strings.

Delete features that have all their values NaN


In [17]:
all_data_no_na_columns = all_data.dropna(axis=1, how='all')

Delete features that are meaningless

There are features that are meaningless for the problem we are trying to solve. We need to drop these features, but we need a criterion to eliminate them. The criterion that we are going to employ is to eliminate the features that are unique for every entry and don't add information to the problem, for example if we have a unique ID for every institution, this ID doesn't add information to the problem.

Also, we need to take in account that there area features that may be unique for every entry, but DOES add relevant information. For example, the tuition fees may be unique and add information.

Let's find the ratio of the number of unique values over number of examples:


In [18]:
# Create a list to save the features that are above a certain threshold
features_with_high_ratio = []
# Create a list to save the features in all_data but not in the dict
features_not_in_dict = []

#Calculate the ratio
for feature in all_data_no_na_columns.columns.values:    
    # Get the row in the dict wich have VARIABLE NAME == feature
    row_in_dict = data_dict_no_nan_names[data_dict_no_nan_names['VARIABLE NAME'] == feature]
    # Get the data type of the row
    data_type_series = row_in_dict['API data type']
    
    #Check if exists in the dict
    if data_type_series.size > 0:
        # Get the data type
        data_type = data_type_series.values[0]
        # float features (numeric features) are not taken in account
        if data_type == 'integer' or data_type == 'string' or data_type ==  'autocomplete':
            column = all_data_no_na_columns[feature]
            column_no_na = column.dropna()
            r = column_no_na.unique().size / column_no_na.size
            if r > 0.8:
                features_with_high_ratio.append(feature)
                print(str(feature) + ": " + str(r))
    #The feature is not in the dict
    else:
        features_not_in_dict.append(feature)

print ("\nFeatures in data but not in the dictionary:" + str(features_not_in_dict))


UNITID: 1.0
OPEID: 0.9962352330260937
INSTNM: 0.9781903154615085
ZIP: 0.8468129300272621
INSTURL: 0.8054025701547338
NPT4_PUB: 0.9419152276295133
NPT4_PRIV: 0.9127559726962458
NPT4_PROG: 0.8748801534036433
NPT4_OTHER: 1.0
NPT41_PUB: 0.9344520188778186
NPT42_PUB: 0.929042904290429
NPT43_PUB: 0.9318568994889267
NPT44_PUB: 0.9426386233269598
NPT45_PUB: 0.9683397683397683
NPT41_PRIV: 0.897950377562028
NPT42_PRIV: 0.9219765929778934
NPT43_PRIV: 0.9270772806507844
NPT44_PRIV: 0.9314720812182741
NPT45_PRIV: 0.9492204899777282
NPT41_PROG: 0.8815915627996165
NPT42_PROG: 0.8791946308724832
NPT43_PROG: 0.87535953978907
NPT44_PROG: 0.8815915627996165
NPT45_PROG: 0.8796740172579098
NPT41_OTHER: 1.0
NPT42_OTHER: 1.0
NPT43_OTHER: 1.0
NPT44_OTHER: 1.0
NPT45_OTHER: 1.0
NPT4_048_PUB: 0.9351125065410779
NPT4_048_PRIV: 0.9043739279588336
NPT4_048_PROG: 0.9616122840690979
NPT4_048_OTHER: 1.0
NPT4_3075_PUB: 0.9302452316076294
NPT4_3075_PRIV: 0.9064928553522186
NPT4_75UP_PUB: 0.9539722572509458
NPT4_75UP_PRIV: 0.9351978171896317
NPT4_3075_PROG: 0.9706235011990407
NPT4_3075_OTHER: 0.9941520467836257
NPT4_75UP_PROG: 0.9839228295819936
NPT4_75UP_OTHER: 0.993421052631579
NUM4_OTHER: 0.9171270718232044
NUM41_OTHER: 0.8870056497175142
COSTT4_A: 0.9387096774193548
COSTT4_P: 0.9354336833916764
TUITFTE: 0.8475928473177442
MEDIAN_HH_INC: 0.8532386867790595
ALIAS: 0.9366391184573003

Features in data but not in the dictionary:['C200_4', 'C200_L4', 'D200_4', 'D200_L4', 'C200_4_POOLED', 'C200_L4_POOLED', 'POOLYRS200', 'D200_4_POOLED', 'D200_L4_POOLED']

So there are some features in the data that are not explained in the dictionary. Tha is not necessarly an inconvenience, so we won't worry abot this right now.

Lets find what those NTP4 features are about


In [19]:
npt4_pub = data_dict_no_nan_names['VARIABLE NAME'] == 'NPT4_PUB'
npt41_pub = data_dict_no_nan_names['VARIABLE NAME'] == 'NPT41_PUB'
npt42_pub = data_dict_no_nan_names['VARIABLE NAME'] == 'NPT42_PUB'
data_dict_no_nan_names[npt4_pub | npt41_pub | npt42_pub ]


Out[19]:
NAME OF DATA ELEMENT dev-category developer-friendly name API data type VARIABLE NAME VALUE LABEL SOURCE NOTES
555 Average net price for Title IV institutions (p... cost avg_net_price.public integer NPT4_PUB NaN NaN IPEDS Shown/used on consumer website; The average...
559 Average net price for $0-$30,000 family income... cost net_price.public.by_income_level.0-30000 integer NPT41_PUB NaN NaN IPEDS Shown/used on consumer website; The average...
560 Average net price for $30,001-$48,000 family i... cost net_price.public.by_income_level.30001-48000 integer NPT42_PUB NaN NaN IPEDS Shown/used on consumer website; ibid

So those NTP4 features are about Average Net prices, so they are defenetly numeric features, and it makes sense to keep them.

Let's run our previous analysis again with out those features so we can have a cleaner visualization as we lower the threshold


In [20]:
# Create a list to save the features that are above a certain threshold
features_with_high_ratio = []
# Create a list to save the features in all_data but not in the dict
features_not_in_dict = []

#Calculate the ratio
for feature in all_data_no_na_columns.columns.values:    
    # Get the row in the dict wich have VARIABLE NAME == feature
    row_in_dict = data_dict_no_nan_names[data_dict_no_nan_names['VARIABLE NAME'] == feature]
    # Get the data type of the row
    data_type_series = row_in_dict['API data type']
    
    #Check if exists in the dict
    if data_type_series.size > 0:
        # Get the data type
        data_type = data_type_series.values[0]
        # float features (numeric features) are not taken in account
        if (data_type == 'integer' or data_type == 'string' or data_type ==  'autocomplete') \
        and feature[:4] != 'NPT4':
            column = all_data_no_na_columns[feature]
            column_no_na = column.dropna()
            r = column_no_na.unique().size / column_no_na.size
            if r > 0.5:
                features_with_high_ratio.append(feature)
                print(str(feature) + ": " + str(r))
print(features_with_high_ratio)


UNITID: 1.0
OPEID: 0.9962352330260937
OPEID6: 0.7350382967674932
INSTNM: 0.9781903154615085
ZIP: 0.8468129300272621
INSTURL: 0.8054025701547338
NPCURL: 0.7592903045543448
UG: 0.5519023986765922
NUM4_PUB: 0.5007849293563579
NUM4_OTHER: 0.9171270718232044
NUM41_OTHER: 0.8870056497175142
NUM42_OTHER: 0.5649717514124294
COSTT4_A: 0.9387096774193548
COSTT4_P: 0.9354336833916764
TUITIONFEE_IN: 0.6820980615735461
TUITIONFEE_OUT: 0.7219698780779344
TUITFTE: 0.8475928473177442
INEXPFTE: 0.7491059147180192
AVGFACSAL: 0.7490101187857457
D150_4_POOLED: 0.6139367240688827
DEP_INC_AVG: 0.6594637855142057
IND_INC_AVG: 0.6594637855142057
AGE_ENTRY: 0.7036147792450314
FAMINC: 0.7158863545418167
MD_FAMINC: 0.6882753101240496
FAMINC_IND: 0.6594637855142057
MEDIAN_HH_INC: 0.8532386867790595
ALIAS: 0.9366391184573003
['UNITID', 'OPEID', 'OPEID6', 'INSTNM', 'ZIP', 'INSTURL', 'NPCURL', 'UG', 'NUM4_PUB', 'NUM4_OTHER', 'NUM41_OTHER', 'NUM42_OTHER', 'COSTT4_A', 'COSTT4_P', 'TUITIONFEE_IN', 'TUITIONFEE_OUT', 'TUITFTE', 'INEXPFTE', 'AVGFACSAL', 'D150_4_POOLED', 'DEP_INC_AVG', 'IND_INC_AVG', 'AGE_ENTRY', 'FAMINC', 'MD_FAMINC', 'FAMINC_IND', 'MEDIAN_HH_INC', 'ALIAS']

Let's see what are these features about:


In [21]:
high_ratio_features = pd.DataFrame()
for feature in features_with_high_ratio:
    high_ratio_features = high_ratio_features.append(data_dict_no_nan_names[data_dict_no_nan_names['VARIABLE NAME'] == feature])
high_ratio_features


Out[21]:
NAME OF DATA ELEMENT dev-category developer-friendly name API data type VARIABLE NAME VALUE LABEL SOURCE NOTES
0 Unit ID for institution root id integer UNITID NaN NaN IPEDS Shown/used on consumer website.
1 8-digit OPE ID for institution root ope8_id integer OPEID NaN NaN IPEDS Shown/used on consumer website.
2 6-digit OPE ID for institution root ope6_id integer OPEID6 NaN NaN IPEDS Shown/used on consumer website.
3 Institution name school name autocomplete INSTNM NaN NaN IPEDS Shown/used on consumer website.
6 ZIP code school zip integer ZIP NaN NaN IPEDS NaN
8 URL for institution's homepage school school_url string INSTURL NaN NaN IPEDS Shown/used on consumer website.
9 URL for institution's net price calculator school price_calculator_url string NPCURL NaN NaN IPEDS Shown/used on consumer website.
529 Enrollment of all undergraduate students student enrollment.all integer UG NaN NaN IPEDS Available in 2000-2001 datafile only
591 Number of Title IV students (public institutions) cost title_iv.public.all integer NUM4_PUB NaN NaN IPEDS The number of full-time, first-time, degree/ce...
594 Number of Title IV students (other academic ca... cost title_iv.academic_year integer NUM4_OTHER NaN NaN IPEDS ibid
610 Number of Title IV students, $0-$30,000 family... cost title_iv.other_acad_calendar.by_income_level.0... integer NUM41_OTHER NaN NaN IPEDS ibid
611 Number of Title IV students, $30,001-$48,000 f... cost title_iv.other_acad_calendar.by_income_level.3... integer NUM42_OTHER NaN NaN IPEDS ibid
615 Average cost of attendance (academic year inst... cost attendance.academic_year integer COSTT4_A NaN NaN IPEDS The average annual total cost of attendance, i...
616 Average cost of attendance (program-year insti... cost attendance.program_year integer COSTT4_P NaN NaN IPEDS ibid
617 In-state tuition and fees cost tuition.in_state integer TUITIONFEE_IN NaN NaN IPEDS NaN
618 Out-of-state tuition and fees cost tuition.out_of_state integer TUITIONFEE_OUT NaN NaN IPEDS NaN
620 Net tuition revenue per full-time equivalent s... school tuition_revenue_per_fte integer TUITFTE NaN NaN IPEDS Net tuition revenue (tuition revenue minus dis...
621 Instructional expenditures per full-time equiv... school instructional_expenditure_per_fte integer INEXPFTE NaN NaN IPEDS Instructional expenditures divided by the numb...
622 Average faculty salary school faculty_salary integer AVGFACSAL NaN NaN IPEDS Average faculty salary per month, calculated f...
633 Adjusted cohort count for completion rate at f... completion completion_cohort_4yr_150nt_pooled integer D150_4_POOLED NaN NaN IPEDS NaN
1661 Average family income of dependent students in... student avg_dependent_income.2014dollars integer DEP_INC_AVG NaN NaN NSLDS NaN
1662 Average family income of independent students ... student avg_independent_income.2014dollars integer IND_INC_AVG NaN NaN NSLDS NaN
1836 Average age of entry student demographics.age_entry integer AGE_ENTRY NaN NaN Treasury Data element describes the earnings cohort (wi...
1844 Average family income in real 2015 dollars student demographics.avg_family_income integer FAMINC NaN NaN NSLDS Data element describes the earnings cohort (wi...
1845 Median family income in real 2015 dollars student demographics.median_family_income integer MD_FAMINC NaN NaN NSLDS Data element describes the earnings cohort (wi...
1846 Average family income for independent students... student demographics.avg_family_income_independents integer FAMINC_IND NaN NaN NSLDS Data element describes the earnings cohort (wi...
1856 Median household income student demographics.median_hh_income integer MEDIAN_HH_INC NaN NaN Treasury Data element describes the earnings cohort (wi...
1959 Institution name aliases school alias autocomplete ALIAS NaN NaN IPEDS NaN

So UNITID, OPEID, OPEID6, INSTNM, INSTURL, NPCURL and ALIAS are features that have to do with the identity of the institution, so they don't add relevant information to the problem, therfore they will be eliminated. (flag_e)

The ZIP code could be useful if it is used to group the schools to some sort of category about it's location. We are not going to to this so we are going to eliminate it as well.


In [22]:
all_data_no_id_cols = all_data_no_na_columns.drop(['UNITID', 'OPEID', 'OPEID6', 'INSTNM', 'INSTURL', 'NPCURL', 'ALIAS', 'ZIP'], axis = 1)

In [23]:
all_data_no_id_cols.head()


Out[23]:
CITY STABBR ACCREDAGENCY SCH_DEG HCM2 MAIN NUMBRANCH PREDDEG HIGHDEG CONTROL ... D100_L4 TRANS_4 DTRANS_4 TRANS_L4 DTRANS_L4 ICLEVEL UGDS_MEN UGDS_WOMEN CDR3_DENOM CDR2_DENOM
0 Normal AL Southern Association of Colleges and Schools C... 3.0 0 1 1 3 4 1 ... NaN 0.0000 1044.0 NaN NaN 1 0.4831 0.5169 1895.0 1574.0
1 Birmingham AL Southern Association of Colleges and Schools C... 3.0 0 1 1 3 4 1 ... NaN 0.2408 1246.0 NaN NaN 1 0.4169 0.5831 3937.0 3481.0
2 Montgomery AL Southern Association of Colleges and Schools C... 3.0 0 1 1 3 4 2 ... NaN 0.0000 9.0 NaN NaN 1 0.3986 0.6014 344.0 336.0
3 Huntsville AL Southern Association of Colleges and Schools C... 3.0 0 1 1 3 4 1 ... NaN 0.3477 765.0 NaN NaN 1 0.5733 0.4267 1513.0 1392.0
4 Montgomery AL Southern Association of Colleges and Schools C... 3.0 0 1 1 3 4 1 ... NaN 0.0000 1330.0 NaN NaN 1 0.3877 0.6123 2137.0 1961.0

5 rows × 1725 columns

Work on the string and autocmplet data


In [24]:
data_dict_no_nan_names[data_dict_no_nan_names['API data type'] == 'string']


Out[24]:
NAME OF DATA ELEMENT dev-category developer-friendly name API data type VARIABLE NAME VALUE LABEL SOURCE NOTES
5 State postcode school state string STABBR NaN NaN IPEDS Shown/used on consumer website.
7 Accreditor for institution school accreditor string ACCREDAGENCY NaN NaN FSA NaN
8 URL for institution's homepage school school_url string INSTURL NaN NaN IPEDS Shown/used on consumer website.
9 URL for institution's net price calculator school price_calculator_url string NPCURL NaN NaN IPEDS Shown/used on consumer website.

We already dropped INSTURL and NPCURL. Let's explore the STABBR feature


In [25]:
all_data_no_id_cols['STABBR']


Out[25]:
0       AL
1       AL
2       AL
3       AL
4       AL
5       AL
6       AL
7       AL
8       AL
9       AL
10      AL
11      AL
12      AL
13      AL
14      AL
15      AL
16      AL
17      AL
18      AL
19      AL
20      AL
21      AL
22      AL
23      AL
24      AL
25      AL
26      AL
27      AL
28      AL
29      AL
        ..
7673    NC
7674    NJ
7675    GA
7676    GA
7677    GA
7678    GA
7679    GA
7680    GA
7681    GA
7682    GA
7683    SC
7684    SC
7685    TX
7686    TX
7687    TX
7688    TX
7689    TX
7690    TX
7691    TX
7692    TX
7693    CA
7694    CA
7695    TX
7696    TX
7697    CA
7698    KS
7699    OH
7700    CA
7701    CA
7702    TX
Name: STABBR, Length: 7703, dtype: object

So this feature has to do with the state where the school is located. Let's explore the ACCREDAGENCY feature:


In [26]:
all_data_no_id_cols['ACCREDAGENCY']


Out[26]:
0       Southern Association of Colleges and Schools C...
1       Southern Association of Colleges and Schools C...
2       Southern Association of Colleges and Schools C...
3       Southern Association of Colleges and Schools C...
4       Southern Association of Colleges and Schools C...
5       Southern Association of Colleges and Schools C...
6       Southern Association of Colleges and Schools C...
7       Southern Association of Colleges and Schools C...
8       Southern Association of Colleges and Schools C...
9       Southern Association of Colleges and Schools C...
10      Southern Association of Colleges and Schools C...
11      Southern Association of Colleges and Schools C...
12      Southern Association of Colleges and Schools C...
13      Southern Association of Colleges and Schools C...
14      Southern Association of Colleges and Schools C...
15      Southern Association of Colleges and Schools C...
16      Southern Association of Colleges and Schools C...
17      Southern Association of Colleges and Schools C...
18      National Accrediting Commission of Cosmetology...
19      Southern Association of Colleges and Schools C...
20      Southern Association of Colleges and Schools C...
21      Southern Association of Colleges and Schools C...
22      North Central Association of Colleges and Scho...
23      Southern Association of Colleges and Schools C...
24               Association for Bibical Higher Educaiton
25      Southern Association of Colleges and Schools C...
26      Southern Association of Colleges and Schools C...
27      Southern Association of Colleges and Schools C...
28      Southern Association of Colleges and Schools C...
29      Southern Association of Colleges and Schools C...
                              ...                        
7673                                                  NaN
7674                    Council on Occupational Education
7675         Middle States Commission on Higher Education
7676         Middle States Commission on Higher Education
7677         Middle States Commission on Higher Education
7678         Middle States Commission on Higher Education
7679         Middle States Commission on Higher Education
7680         Middle States Commission on Higher Education
7681         Middle States Commission on Higher Education
7682                                                  NaN
7683         Middle States Commission on Higher Education
7684         Middle States Commission on Higher Education
7685         Middle States Commission on Higher Education
7686         Middle States Commission on Higher Education
7687         Middle States Commission on Higher Education
7688         Middle States Commission on Higher Education
7689         Middle States Commission on Higher Education
7690         Middle States Commission on Higher Education
7691         Middle States Commission on Higher Education
7692         Middle States Commission on Higher Education
7693    Accrediting Commission of Career Schools and C...
7694       Accrediting Bureau of Health Education Schools
7695                    Council on Occupational Education
7696                    Council on Occupational Education
7697    Accrediting Council for Independent Colleges a...
7698    North Central Association of Colleges and Scho...
7699    Accrediting Council for Continuing Education &...
7700    Accrediting Council for Independent Colleges a...
7701    Accrediting Council for Continuing Education &...
7702                    Council on Occupational Education
Name: ACCREDAGENCY, Length: 7703, dtype: object

In [27]:
all_data_no_id_cols['ACCREDAGENCY'].value_counts()


Out[27]:
North Central Association of Colleges and Schools The Higher Learning Commission                                                                      1218
National Accrediting Commission of Cosmetology Arts and Sciences                                                                                      1014
Southern Association of Colleges and Schools Commission on Colleges                                                                                    859
Middle States Commission on Higher Education                                                                                                           656
Accrediting Commission of Career Schools and Colleges                                                                                                  555
Accrediting Council for Independent Colleges and Schools                                                                                               476
Council on Occupational Education                                                                                                                      472
New England Association of Schools and Colleges Commission on Institutions of Higher Education                                                         235
Western Association of Schools and Colleges Senior Colleges and University Commission                                                                  208
Northwest Commission on Colleges and Universities                                                                                                      174
Accrediting Bureau of Health Education Schools                                                                                                         169
Western Association of Schools and Colleges Accrediting Commission for Community and Junior Colleges                                                   159
Accrediting Council for Continuing Education & Training                                                                                                103
Commission on Accrediting of the Association of Theological Schools                                                                                     76
Association of Advanced Rabbinical and Talmudic Schools                                                                                                 62
Association for Bibical Higher Educaiton                                                                                                                49
Oklahoma State Board of Career and Technology Education                                                                                                 45
Transnational Association of Christian Colleges and Schools                                                                                             41
Accreditation Commission for Education in Nursing Inc.                                                                                                  40
Commission on Massage Therapy Accreditation                                                                                                             30
Accrediting Commission for Acupuncture and Oriental Medicine                                                                                            29
American Bar Association                                                                                                                                20
Middle States Commission on Secondary Education                                                                                                         15
New York State Board of Regents State Education Department Office of the Professions (Public Postsecondary Vocational Education Practical Nursing)      14
New York State Board of Regents and the Commissioner of Education                                                                                       13
Distance Education Accrediting Commission                                                                                                               12
National Association of Schools of Arts and Design                                                                                                       9
Pennsylvania State Board of Vocational Education Bureau of Career and Technical Education                                                                7
National Association of Schools of Theatre                                                                                                               7
Joint Review Committee on Education in Radiologic Technology                                                                                             7
American Board of Funeral Service Education Committee on Accreditation                                                                                   7
National Association of Schools of Music                                                                                                                 6
Puerto Rico State Agency for the Approval of Public Postsecondary Vocational Technical Institutions and Programs                                         4
Midwifery Education Accreditation Council                                                                                                                4
American Osteopathic Association                                                                                                                         3
Montessori Accreditation Council for Teacher Education                                                                                                   3
National Association of Schools of Dance                                                                                                                 2
Council on Accreditation of Nurse Anesthesia Educational Programs                                                                                        1
American Podiatric Medical Association                                                                                                                   1
Name: ACCREDAGENCY, dtype: int64

Now les's explore the autocomplete data type:


In [28]:
data_dict_no_nan_names[data_dict_no_nan_names['API data type'] == 'autocomplete']


Out[28]:
NAME OF DATA ELEMENT dev-category developer-friendly name API data type VARIABLE NAME VALUE LABEL SOURCE NOTES
3 Institution name school name autocomplete INSTNM NaN NaN IPEDS Shown/used on consumer website.
4 City school city autocomplete CITY NaN NaN IPEDS Shown/used on consumer website.
1959 Institution name aliases school alias autocomplete ALIAS NaN NaN IPEDS NaN

INSTNM and ALIAS where dropped, let's see the CITY feature:


In [29]:
all_data_no_id_cols['CITY']


Out[29]:
0                 Normal
1             Birmingham
2             Montgomery
3             Huntsville
4             Montgomery
5             Tuscaloosa
6         Alexander City
7                 Athens
8             Montgomery
9                 Auburn
10            Birmingham
11           Phenix City
12                 Selma
13            Montgomery
14            Enterprise
15           Bay Minette
16            Montgomery
17               Gadsden
18           Albertville
19                Dothan
20            Hanceville
21                 Selma
22            Birmingham
23            Montgomery
24              Florence
25            Huntsville
26          Jacksonville
27               Brewton
28            Birmingham
29                Tanner
              ...       
7673           Charlotte
7674         Jersey City
7675             Atlanta
7676              Morrow
7677             Roswell
7678        Douglasville
7679            Lithonia
7680            Savannah
7681             Augusta
7682            Columbus
7683            Columbia
7684    North Charleston
7685              Irving
7686             Houston
7687             Houston
7688               Plano
7689          Cedar Hill
7690              Dallas
7691         San Antonio
7692            Stafford
7693             Fremont
7694              Merced
7695             El Paso
7696              Austin
7697          Emeryville
7698       Overland Park
7699    Highland Heights
7700            San Jose
7701           Lancaster
7702         San Antonio
Name: CITY, Length: 7703, dtype: object

So STABBR, ACCREDAGENCY and CITY are features that we are going to keep, but they need to be transformed to an ordinal (using numbers) representation, since the ML algorithms use numbers and not strings.


In [30]:
all_data_no_strings = all_data_no_id_cols.copy()

#STABBR mapping
values = all_data_no_strings['STABBR'].unique()
mapping = {}
numeric_value = 1
for value in values:
    mapping[value] = numeric_value
    numeric_value += 1
all_data_no_strings['STABBR'] = all_data_no_strings['STABBR'].map(mapping)

#ACCREDAGENCY mapping
values = all_data_no_id_cols['ACCREDAGENCY'].unique()
mapping = {}
numeric_value = 1
for value in values:
    mapping[value] = numeric_value
    numeric_value += 1
all_data_no_strings['ACCREDAGENCY'] = all_data_no_strings['ACCREDAGENCY'].map(mapping)

#CITY mapping
values = all_data_no_id_cols['CITY'].unique()
mapping = {}
numeric_value = 1
for value in values:
    mapping[value] = numeric_value
    numeric_value += 1
all_data_no_strings['CITY'] = all_data_no_strings['CITY'].map(mapping)

all_data_no_strings.head()


Out[30]:
CITY STABBR ACCREDAGENCY SCH_DEG HCM2 MAIN NUMBRANCH PREDDEG HIGHDEG CONTROL ... D100_L4 TRANS_4 DTRANS_4 TRANS_L4 DTRANS_L4 ICLEVEL UGDS_MEN UGDS_WOMEN CDR3_DENOM CDR2_DENOM
0 1 1 1 3.0 0 1 1 3 4 1 ... NaN 0.0000 1044.0 NaN NaN 1 0.4831 0.5169 1895.0 1574.0
1 2 1 1 3.0 0 1 1 3 4 1 ... NaN 0.2408 1246.0 NaN NaN 1 0.4169 0.5831 3937.0 3481.0
2 3 1 1 3.0 0 1 1 3 4 2 ... NaN 0.0000 9.0 NaN NaN 1 0.3986 0.6014 344.0 336.0
3 4 1 1 3.0 0 1 1 3 4 1 ... NaN 0.3477 765.0 NaN NaN 1 0.5733 0.4267 1513.0 1392.0
4 3 1 1 3.0 0 1 1 3 4 1 ... NaN 0.0000 1330.0 NaN NaN 1 0.3877 0.6123 2137.0 1961.0

5 rows × 1725 columns

Let's see how our data looks so far


In [31]:
all_data_no_strings.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7703 entries, 0 to 7702
Columns: 1725 entries, CITY to CDR2_DENOM
dtypes: float64(433), int64(13), object(1279)
memory usage: 101.4+ MB

Although we mapped or eliminated the string features, we still have a lot object (not numeric) data types. Let's work on them

Fetures with object dtype

Let's try to find a sample of features that should be numbers, but for some reason in the data they are not numbers


In [32]:
regex = re.compile('[0-9]+(\.[0-9]+)?$')
words = []
for column in all_data_no_strings:
    if all_data_no_strings[column].dtypes == 'object':
        for data in all_data_no_strings[column]:
            if not regex.match(str(data)):
                words.append(data)

In [33]:
pd.Series(words).value_counts()


Out[33]:
PrivacySuppressed    5074226
12/31/2999              7613
03/24/2014                93
05/09/2014                89
05/16/2014                88
02/26/2014                85
05/08/2014                61
05/15/2014                57
06/29/2014                56
05/02/2014                51
07/09/2014                49
06/12/2014                47
05/14/2014                45
06/25/2014                45
06/11/2014                44
06/13/2014                44
05/13/2014                43
05/12/2014                43
07/27/2014                41
05/10/2014                40
05/22/2014                37
06/20/2014                34
07/07/2014                34
05/19/2014                34
05/23/2014                33
05/11/2014                31
06/26/2014                30
05/07/2014                30
06/24/2014                29
06/02/2014                29
                      ...   
11/18/2013                 1
08/08/2013                 1
02/16/2014                 1
07/24/2013                 1
09/24/2013                 1
12/23/2013                 1
07/20/2013                 1
08/09/2013                 1
12/04/2013                 1
05/02/2013                 1
11/21/2013                 1
12/11/2013                 1
02/22/2014                 1
09/01/2014                 1
12/14/2013                 1
01/10/2014                 1
07/28/2013                 1
11/03/2013                 1
10/25/2013                 1
12/02/2013                 1
12/27/2013                 1
10/31/2013                 1
09/11/2013                 1
03/23/2014                 1
08/19/2013                 1
05/29/2013                 1
12/30/2013                 1
11/08/2013                 1
05/07/2013                 1
02/02/2014                 1
Length: 312, dtype: int64

We can see that there is a lot of data suppresed for privacy reasons. Also, there are dates, and one of them 12/31/2999 seems to be invalid. Let's go ahead and replace these values with nan, so we will treat it as any nan value. Also, if any column ends having all of its values as Nan, we will delete this column.


In [34]:
all_data_replaced_with_nan = all_data_no_strings.replace(to_replace = 'PrivacySuppressed', value = np.nan)
all_data_replaced_with_nan = all_data_replaced_with_nan.replace(to_replace = '12/31/2999', value = np.nan)
all_data_replaced_with_nan = all_data_replaced_with_nan.dropna(axis=1, how='all')

In [35]:
all_data_replaced_with_nan.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7703 entries, 0 to 7702
Columns: 1714 entries, CITY to CDR2_DENOM
dtypes: float64(433), int64(13), object(1268)
memory usage: 100.7+ MB

Lets find wich features are date features


In [36]:
features_with_date = []
for column in all_data_replaced_with_nan:
    if all_data_replaced_with_nan[column].dtypes == 'object':
        if all_data_replaced_with_nan[column].str.match('[0-9]{2}/[0-9]{2}/[0-9]{4}').any():
            features_with_date.append(column)

In [37]:
features_with_date


Out[37]:
['SEPAR_DT_MDN']

In [38]:
data_dict_no_nan_names[data_dict_no_nan_names['VARIABLE NAME'] == 'SEPAR_DT_MDN']


Out[38]:
NAME OF DATA ELEMENT dev-category developer-friendly name API data type VARIABLE NAME VALUE LABEL SOURCE NOTES
1774 Median Date Student Separated completion separation_date.median integer SEPAR_DT_MDN NaN NaN NSLDS NOTE: Students who had not yet separated by Se...

It seems that SEPAR_DT_MDN don't add valuable information to the problem, so we are going to drop it


In [39]:
all_data_no_dates = all_data_replaced_with_nan.drop(['SEPAR_DT_MDN'], axis = 1)

Now we will transfore all the object features to numeric


In [40]:
all_data_no_objects = all_data_no_dates.copy()
for feature in all_data_no_dates:
    if all_data_no_dates[feature].dtypes == 'object':
        #Make all data numeric
        all_data_no_objects[feature] = pd.to_numeric(all_data_no_dates[feature])

In [41]:
all_data_no_objects.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7703 entries, 0 to 7702
Columns: 1713 entries, CITY to CDR2_DENOM
dtypes: float64(1700), int64(13)
memory usage: 100.7 MB

Now we have gotten rid of the object dtype

Eliminate features with high number of NaN values

We already deleted features with that had all of their value as NaN, but now we will eliminate features with a high percentage of NaN values (more than 90%)


In [42]:
high_nan_features = []
for feature in all_data_no_objects:
    size = all_data_no_objects[feature].size
    number_of_valid = all_data_no_objects[feature].count()
    number_of_nan =  size - number_of_valid
    ratio = number_of_nan / size
    if ratio > 0.9:
        high_nan_features.append(feature)
print (len(high_nan_features))


262

In [43]:
all_data_no_high_nan = all_data_no_objects.drop(high_nan_features, axis = 1)

In [44]:
all_data_no_high_nan.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7703 entries, 0 to 7702
Columns: 1451 entries, CITY to CDR2_DENOM
dtypes: float64(1438), int64(13)
memory usage: 85.3 MB

Filling missing data

We need to fill the mising data. To do this we need to know if the feature is numeric or categorical. Let's use the dictionary to get that info.


In [45]:
data_dict[15:25]


Out[45]:
NAME OF DATA ELEMENT dev-category developer-friendly name API data type VARIABLE NAME VALUE LABEL SOURCE NOTES
15 Predominant undergraduate degree awarded\n 0 N... school degrees_awarded.predominant integer PREDDEG 0.0 Not classified IPEDS Shown/used on consumer website.
16 NaN school NaN NaN NaN 1.0 Predominantly certificate-degree granting NaN NaN
17 NaN school NaN NaN NaN 2.0 Predominantly associate's-degree granting NaN NaN
18 NaN school NaN NaN NaN 3.0 Predominantly bachelor's-degree granting NaN NaN
19 NaN school NaN NaN NaN 4.0 Entirely graduate-degree granting NaN NaN
20 Highest degree awarded\n 0 Non-degree-granting... school degrees_awarded.highest integer HIGHDEG 0.0 Non-degree-granting IPEDS Shown/used on consumer website.
21 NaN school NaN NaN NaN 1.0 Certificate degree NaN NaN
22 NaN school NaN NaN NaN 2.0 Associate degree NaN NaN
23 NaN school NaN NaN NaN 3.0 Bachelor's degree NaN NaN
24 NaN school NaN NaN NaN 4.0 Graduate degree NaN NaN

We can see that after the name of a categorical feature, there is at least one item with value NaN. Let's use this to get a list of categorical features


In [46]:
categorical_features = []
is_null = data_dict['NAME OF DATA ELEMENT'].isnull()
for i in range(len(is_null) - 1):
    if not is_null[i] and is_null[i+1]:
        categorical_features.append(data_dict['VARIABLE NAME'][i])

To fill the missing data that belongs to a categorical feature, we will use the most common value of the data (mode). To fill the missing data that belongs to a numeric feature, we will use the the average of the data (mean).


In [47]:
all_data_no_nan = all_data_no_high_nan.copy()
for feature in all_data_no_high_nan:
    if feature in categorical_features:
        mode = all_data_no_high_nan[feature].mode()[0]        
        all_data_no_nan[feature] = all_data_no_high_nan[feature].fillna(mode)
    else:
        mean = all_data_no_high_nan[feature].mean()
        all_data_no_nan[feature] = all_data_no_high_nan[feature].fillna(mean)

In [48]:
all_data_no_nan.head()


Out[48]:
CITY STABBR ACCREDAGENCY SCH_DEG HCM2 MAIN NUMBRANCH PREDDEG HIGHDEG CONTROL ... D100_L4 TRANS_4 DTRANS_4 TRANS_L4 DTRANS_L4 ICLEVEL UGDS_MEN UGDS_WOMEN CDR3_DENOM CDR2_DENOM
0 1 1 1 3.0 0 1 1 3 4 1 ... 308.762923 0.0000 1044.0 0.046848 308.666836 1 0.4831 0.5169 1895.0 1574.0
1 2 1 1 3.0 0 1 1 3 4 1 ... 308.762923 0.2408 1246.0 0.046848 308.666836 1 0.4169 0.5831 3937.0 3481.0
2 3 1 1 3.0 0 1 1 3 4 2 ... 308.762923 0.0000 9.0 0.046848 308.666836 1 0.3986 0.6014 344.0 336.0
3 4 1 1 3.0 0 1 1 3 4 1 ... 308.762923 0.3477 765.0 0.046848 308.666836 1 0.5733 0.4267 1513.0 1392.0
4 3 1 1 3.0 0 1 1 3 4 1 ... 308.762923 0.0000 1330.0 0.046848 308.666836 1 0.3877 0.6123 2137.0 1961.0

5 rows × 1451 columns


In [49]:
all_data_no_nan.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7703 entries, 0 to 7702
Columns: 1451 entries, CITY to CDR2_DENOM
dtypes: float64(1438), int64(13)
memory usage: 85.3 MB

Let's save the data in a file


In [50]:
all_data_no_nan.to_csv('datasets/CollegeScorecardDataCleaned.csv', index = False)

In [ ]: